Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-3644

Speed up Python DirectRunner execution by using the FnApiRunner when possible

Details

    • Improvement
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.2.0, 2.3.0
    • 2.4.0
    • sdk-py-core
    • None

    Description

      Local execution of Beam pipelines on the current Python DirectRunner currently suffers from performance issues, which makes it hard for pipeline authors to iterate, especially on medium to large size datasets. We would like to optimize and make this a better experience for Beam users.

      The FnApiRunner was written as a way of leveraging the portability framework execution code path for local execution for portability development. We've found it also offers great speedups in batch execution, so we propose to switch to use this runner in batch pipelines. For example, WordCount on the Shakespeare dataset with a single CPU core now takes 50 seconds to run, compared to 12 minutes before, a 15x performance improvement that users can get for free, with no pipeline changes.

      Attachments

        Activity

          People

            ccy Charles Chen
            ccy Charles Chen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 40m
                2h 40m